{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Feature Extraction\n",
    "\n",
    "## Goals\n",
    "1. Introduction to Feature Extraction\n",
    "2. Do Feature Extraction on Text - Introduction to Bag of Words"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Introduction\n",
    "\n",
    "Machine Learning algorithms all take the same basic form of input: a fixed length list of numbers.  Very few real world problems are a fixed length list of numbers, so a crucial step in machine learning is converting the data into this format.  Sometimes the individual numbers are called \"features\" so this process is sometimes called \"feature extraction\".  A fixed length list of numbers is also known as a vector.  A list of vectors with all the same length is known as a matrix.\n",
    "\n",
    "Right now our input is a list of tweets that looks like this:\n",
    "\n",
    "<img src=\"images/tweets.png\" width=\"400\"/>\n",
    "\n",
    "We need to convert it into a list of feature vectors that look like this:\n",
    "\n",
    "<img src=\"images/features.png\" width=\"400\"/>\n",
    "\n",
    "You might want to stop and think about how you might do this.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Bag of Words\n",
    "\n",
    "One way to convert our input into a vector is to make each row correspond to a different word and each cell correspond to the number of times that word occured in a particular tweet.\n",
    "\n",
    "<img src=\"images/tweet-transform.png\" width=\"600\"/>\n",
    "\n",
    "This creates a lot of columns!  This is the most basic feature extraction method used on text in natural language processing.\n",
    "\n",
    "Scikit has methods to make this transofrmation really easy.  In [sklearn.feature_extraction.text](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) there is a class called CountVectorizer we will use.\n",
    "\n",
    "CountVecotizer has two important methods\n",
    "1. *fit* sets things up, associating each word with a column\n",
    "2. *transform* converts a list of strings into feature vectors"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "CountVectorizer(analyzer='word', binary=False, decode_error='strict',\n",
       "        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',\n",
       "        lowercase=True, max_df=1.0, max_features=None, min_df=1,\n",
       "        ngram_range=(1, 1), preprocessor=None, stop_words=None,\n",
       "        strip_accents=None, token_pattern='(?u)\\\\b\\\\w\\\\w+\\\\b',\n",
       "        tokenizer=None, vocabulary=None)"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import pandas as pd\n",
    "import numpy as np\n",
    "\n",
    "df = pd.read_csv('../scikit/tweets.csv')\n",
    "target = df['is_there_an_emotion_directed_at_a_brand_or_product']\n",
    "text = df['tweet_text']\n",
    "\n",
    "# We need to remove the empty rows from the text before we pass into CountVectorizer\n",
    "fixed_text = text[pd.notnull(text)]\n",
    "fixed_target = target[pd.notnull(text)]\n",
    "\n",
    "# Do the feature extraction\n",
    "from sklearn.feature_extraction.text import CountVectorizer\n",
    "count_vect = CountVectorizer()  # initialize the count vectorizer\n",
    "count_vect.fit(fixed_text)      # set up the columns"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now our count_vect object is able to transform text into feautre vectors.\n",
    "\n",
    "We can try it out:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<1x9706 sparse matrix of type '<class 'numpy.int64'>'\n",
       "\twith 4 stored elements in Compressed Sparse Row format>"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "count_vect.transform([\"My iphone is awesome\"])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "A *sparse* matrix is a matrix with mostly zeros, and we are definitely dealing with a sparse matric since most of the counts here are zero."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "  (0, 876)\t1\n",
      "  (0, 4573)\t1\n",
      "  (0, 4596)\t1\n",
      "  (0, 5699)\t1\n"
     ]
    }
   ],
   "source": [
    "print(count_vect.transform([\"My iphone is awesome\"]))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This notation says that cells in columns 876, 4573, 4596, 5699 are one and all other cells are zero.  We have one row here because we passed in a list of length one - just the tweet \"My iphone is awesome\".\n",
    "\n",
    "Some questions to ask yourself now:\n",
    "- which words correspond to which columns?\n",
    "- is our transformation case senstitive?\n",
    "- how many columns do we have?\n",
    "\n",
    "Let's do the transformation on all of our tweets to build our big feature matrix (you can think of a matrix as a list of fixed size vectors)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "(9092, 9706)\n"
     ]
    }
   ],
   "source": [
    "counts = count_vect.transform(fixed_text)\n",
    "print(counts.shape)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Great!  Now we have a feature matrix that we can feed in to our machine learning algorithm.  It has 9092 rows corresponding to 9092 tweets and 9706 columns corresponding to 9706 words."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Takeaways\n",
    "\n",
    "1. All machine learning algorithms have the same API - a list of fixed-length vectors of numbers - also known as a Feature Matrix\n",
    "2. Data almost never comes in a list of fixed length vectors, so this transofrmation is critical, and highly application dependant.\n",
    "3. When dealing with text data, \"bag of words\" is a common way to do feature extraction.\n",
    "\n",
    "## Questions\n",
    "\n",
    "1. What would be another way to transform text?\n",
    "2. What information is lost in the \"bag-of-words\" transformation?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.1"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
